Arithmetic functions in torus and tree networks

ABSTRACT

Methods and systems for performing arithmetic functions. In accordance with a first aspect of the invention, methods and apparatus are provided, working in conjunction of software algorithms and hardware implementation of class network routing, to achieve a very significant reduction in the time required for global arithmetic operation on the torus. Therefore, it leads to greater scalability of applications running on large parallel machines. The invention involves three steps in improving the efficiency and accuracy of global operations: (1) Ensuring, when necessary, that all the nodes do the global operation on the data in the same order and so obtain a unique answer, independent of roundoff error; (2) Using the topology of the torus to minimize the number of hops and the bidirectional capabilities of the network to reduce the number of time steps in the data transfer operation to an absolute minimum; and (3) Using class function routing to reduce latency in the data transfer. With the method of this invention, every single element is injected into the network only once and it will be stored and forwarded without any further software overhead. In accordance with a second aspect of the invention, methods and systems are provided to efficiently implement global arithmetic operations on a network that supports the global combining operations. The latency of doing such global operations are greatly reduced by using these methods.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] The present invention claims the benefit of commonly-owned,co-pending U.S. Provisional Patent Application Serial No. 60/271,124filed Feb. 24, 2001 entitled MASSIVELY PARALLEL SUPERCOMPUTER, the wholecontents and disclosure of which is expressly incorporated by referenceherein as if fully set forth herein. This patent application isadditionally related to the following commonly-owned, co-pending UnitedStates patent applications filed on even date herewith, the entirecontents and disclosure of each of which is expressly incorporated byreference herein as if fully set forth herein. U.S. patent applicationSer. Nos. (YOR920020027US1, YOR920020044US1 (15270)), for “ClassNetworking Routing”; U.S. patent application Ser. No. (YOR920020028US1(15271)), for “A Global Tree Network for Computing Structures”; U.S.patent application Ser. No. (YOR920020029US1 (15272)), for ‘GlobalInterrupt and Barrier Networks”; U.S. patent application Ser. No.(YOR920020030US1 (15273)), for ‘Optimized Scalable Network Switch”; U.S.patent application Ser. Nos. (YOR920020031US1, YOR920020032US1 (15258)),for “Arithmetic Functions in Torus and Tree Networks’; U.S. patentapplication Ser. Nos. (YOR920020033US1, YOR920020034US1 (15259)), for‘Data Capture Technique for High Speed Signaling”; U.S. patentapplication Ser. No. (YOR920020035US1 (15260)), for ‘Managing CoherenceVia Put/Get Windows’; U.S. patent application Ser. Nos.(YOR920020036US1, YOR920020037US1 (15261)), for “Low Latency MemoryAccess And Synchronization”; U.S. patent application Ser. No.(YOR920020038US1 (15276), for ‘Twin-Tailed Fail-Over for FileserversMaintaining Full Performance in the Presence of Failure”; U.S. patentapplication Ser. No. (YOR920020039US1 (15277)), for “Fault IsolationThrough No-Overhead Link Level Checksums’; U.S. patent application Ser.No. (YOR920020040US1 (15278)), for “Ethernet Addressing Via PhysicalLocation for Massively Parallel Systems”; U.S. patent application Ser.No. (YOR920020041US1 (15274)), for “Fault Tolerance in a SupercomputerThrough Dynamic Repartitioning”; U.S. patent application Ser. No.(YOR920020042US1 (15279)), for “Checkpointing Filesystem”; U.S. patentapplication Ser. No. (YOR920020043US1 (15262)), for “EfficientImplementation of Multidimensional Fast Fourier Transform on aDistributed-Memory Parallel Multi-Node Computer”; U.S. patentapplication Ser. No. (YOR9-20010211US2 (15275)), for “A Novel MassivelyParallel Supercomputer”; and U.S. patent application Ser. No.(YOR920020045US1 (15263)), for “Smart Fan Modules and System”.

BACKGROUND ART

[0002] Provisional patent application No. 60/271,124, titled “A NovelMassively Parallel SuperComputer” describes a computer comprised of manycomputing nodes and a smaller number of I/O nodes. These nodes areconnected by several networks. In particular, these nodes areinterconnected by both a torus network and by a dual functional treenetwork. This torus network may be used in a number of ways to improvethe efficiency of the computer.

[0003] To elaborate, on a machine which has a large enough number ofnodes and with a network that has the connectivity of an M-dimensionaltorus, the usual way to do a global operation is by the means of shiftand operate. For example, to do a global sum (MPI_SUM) over all nodes,after each computer node has done its own local partial sum, each nodefirst sends the local sum to its plus neighbor along one dimension andthen adds the number it itself received from its neighbor to its ownsum. Second, it passes the number it received from its minus neighbor toits plus neighbor, and again adds the number it receives to its own sum.Repeating the second step (N−1) times (where N is the number of nodesalong this one dimension) followed by repeating the whole sequence overall dimensions one at a time, yields the desired results on all nodes.However, for floating point numbers, because the order of the floatingpoint sums performed at each node is different, each node will end upwith a slightly different result because of roundoff effects due to thefact that the order of the floating point sums performed at each node isdifferent. This will cause a problem if some global decision is to bemade which depends on the value of the global sum. In many cases, thisproblem is avoided by picking a special node which will first gatherdata from all the other nodes, do the whole computation and thenbroadcast the sum to all nodes. However, when the number of nodes issufficiently large, this method is slower than the shift and operatemethod.

[0004] In addition, as indicated above, in the computer disclosed inprovisional patent application No. 60/271,124, the nodes are alsoconnected by a dual-functional tree network that supports integercombining operations, such as integer sums and integer maximums (max)and minimums (min). The existence of a global combining network opens uppossibilities to efficiently implement global arithmetic operations overthis network. For example, adding up floating point numbers from each ofthe computing nodes, and broadcasting the sum to all participatingnodes. On a regular parallel supercomputer, these kinds of operationsare usually done over the network that carries the normalmessage-passing traffic. There is usually high latency associated withsuch kinds of global operations.

SUMMARY OF THE INVENTION

[0005] An object of this invention is to improve procedures forcomputing global values for global operations on a distributed parallelcomputer.

[0006] Another object of the present invention is to compute a uniqueglobal value for a global operation using the shift and operate methodin a highly efficient way on distributed parallel M-torus architectureswith a large number of nodes.

[0007] A further object of the invention is to provide a method andapparatus, working in conjunction with software algorithms and hardwareimplementations of class network routing, to achieve a very significantreduction in the time required for global arithmetic operations on atorus architecture.

[0008] Another object of this invention is to efficiently implementglobal arithmetic operations on a network that supports global combiningoperations.

[0009] A further objective of the invention is to implement globalarithmetic operations to generate binary reproducible results.

[0010] An object of the present invention is to provide an improvedprocedure for conducting a global sum operation.

[0011] A further object of the invention is to provide an improvedprocedure for conducting a global all gather operation.

[0012] These and other objectives are attained with the below describedmethods and systems for performing arithmetic functions. In accordancewith a first aspect of the invention, methods and apparatus areprovided, working in conjunction of software algorithms and hardwareimplementation of class network routing, to achieve a very significantreduction in the time required for global arithmetic operations on thetorus. This leads to greater scalability of applications running onlarge parallel machines. The invention involves three steps forimproving the efficiency, accuracy and exact reproducibility of globaloperations:

[0013] 1. Ensuring, when necessary, that all the nodes do the globaloperation on the data in the same order and so obtain a unique answer,independent of roundoff error.

[0014] 2. Using the topology of the torus to minimize the number of hopsand the bidirectional capabilities of the network to reduce the numberof time steps in the data transfer operation to an absolute minimum.

[0015] 3. Using class function routing (patent application Ser. No.______ (Attorney Docket 15270)) to reduce latency in the data transfer.With the method of this invention, every single element is injected intothe network only once and it will be stored and forwarded without anyfurther software overhead.

[0016] In accordance with a second aspect of the invention, methods andsystems are provided to efficiently implement global arithmeticoperations on a network that supports the global combining operations.The latency of doing such global operations are greatly reduced by usingthese methods. In particular, with a combing tree network that supportsinteger maximum MAX, addition SUM, and bitwise AND, OR, and XOR, one canimplement virtually all predefined global reduce operations in MPI(Message-Passing Interface Standard): MPI_SUM, MPI_MAX, MPI_MIN,MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC,AND MPI_MINLOC plus MPI_ALLGATHER over this network. The implementationsare easy and efficient, demonstrating the great flexibility andefficiency a combining tree network brings to a large scale parallelsupercomputer.

[0017] Further benefits and advantages of the invention will becomeapparent from a consideration of the following detailed description,given with reference to the accompanying drawings, which specify andshow preferred embodiments of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

[0018]FIG. 1 schematically represents a torus network that connects thenodes of a computer. The wrapped network connections are not shown.

[0019]FIG. 2 schematically represents a tree network that connects thenodes of a computer.

[0020]FIG. 3 illustrates a procedure for performing a global sum on aone-dimensional torus.

[0021]FIG. 4 is a table identifying steps that can be used to improvethe efficiency of global arithmetic operations on a torus architecture.

[0022]FIG. 5 illustrates the operation of global sum on adual-functional tree network.

[0023]FIG. 6 illustrates the operation of global all gathering on adual-functional tree network.

[0024]FIG. 7 illustrates a 3 by 4 torus network.

[0025]FIG. 8 illustrates a tree network for doing a final broadcastoperation.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0026] The present invention relates to performing arithmetic functionsin a computer, and one suitable computer is disclosed in provisionalpatent application No. 60/271,124. This computer is comprised of manycomputing nodes and a smaller number of I/O nodes; and the nodes of thiscomputer are interconnected by both a torus network, schematicallyrepresented at 10 in FIG. 1, and a dual functional tree network,schematically represented at 20 in FIG. 2.

[0027] More specifically, one aspect of the present invention provides amethod and apparatus working in conjunction with software algorithms andhardware implementation of class network routing, to achieve a verysignificant reduction in the time required for global arithmeticoperation on the torus architecture. Therefore, it leads to greaterscalability of applications running on large parallel machines. Asillustrated in FIG. 3, the invention involves three steps in improvingthe efficiency and accuracy of global operations:

[0028] 1. Ensuring, when necessary, that all the nodes do the globaloperation on the data in the same order, and so obtain a unique answer,independent of roundoff error.

[0029] 2. Using the topology of the torus to minimize the number of hopsand the bidirectional capabilities of the network to reduce the numberof time steps in the data transfer operation to an absolute minimum.

[0030] 3. Using class function routing to reduce latency in the datatransfer. With the preferred method of this invention, every singleelement is injected into the network only once and it will be stored andforwarded without any further software overhead.

[0031] Each of these steps is discussed below in detail.

[0032] 1. Ensuring that all nodes to the global operation (eg. MPI_SUM)in the same order:

[0033] When doing the one-dimensional shift and addition of the localpartial sums, instead of adding numbers when they come in, each nodewill keep the N−1 numbers received for each direction. The globaloperation is performed on the numbers after they have all been receivedso that the operation is done in a fixed order and results in a uniqueresult on all nodes.

[0034] For example, as illustrated in FIG. 4, if each node adds numbersas they are received, the sums computed would be S0+S3+S2+S1 on node0,S1+S0+S3+S2 on node1, S2+S1+S0+S3 on node3, and S3+S2+S1+S0 on node4.There should be roundoff differences between these sums. However, if thenumbers that every node receives are kept in memory, than all the nodescould do the sum in the same order to obtain S0+S1+S2+S3 and there wouldbe no roundoff difference.

[0035] This is repeated for all the other dimensions. At the end, allthe nodes will have the same number and the final broadcast isunnecessary.

[0036] 2. Minimizing the number of steps in the data transfer on anM-torus architecture:

[0037] On any machine where the network links between two neighboringnodes are bidirectional, we can send data in both directions in each ofthe steps. This will mean that the total distance each data element hasto travel on the network is reduced by a factor of two. This reduces thetime for doing global arithmetic on the torus also by almost a factor oftwo.

[0038] 3. Reducing the latency using class function routing:

[0039] Additional performance gains can be achieved by including a storeand forward class network routing operation in the network hardware,thereby eliminating the software overhead of extracting and injectingthe same data element multiple times into the network. When implementingglobal arithmetic operations on a network capable of class routing,steps 1 to 3 illustrated in FIG. 4 will simply become a single networkstep; i.e., each node will need to inject a number only once and everyother node will automatically forward this number along to all the othernodes that need it, while keeping a copy for its own use. This willgreatly reduce the latency of the global operation. Instead of payingsoftware overhead every hop on the torus, one will pay only a singleoverhead cost per dimension of the machine. For example, on the computersystem disclosed in Provisional Application No. 60/271,124, we estimatethat there will be an improvement of at least a factor of five when theCPU is running in single user mode and more than a factor of ten inmulti-user mode.

[0040] With the three improvement steps discussed above, one can achieveat least a factor of ten improvement for global arithmetic operations ondistributed parallel architectures and also greatly improve thescalabilty of applications on a large parallel machine.

[0041] In addition, as previously mentioned, in the computer systemdisclosed in the above-identified provisional application, the nodes arealso connected by a tree network that supports data combiningoperations, such as integer sums and integer maximums and minimums,bitwise AND, OR and XORs. In addition, the tree network willautomatically broadcast the final combined result to all participatingnodes. With a computing network supporting global combining operations,many of the global communication patterns can be efficiently supportedby this network. By far the simplest requirement for the combiningnetwork hardware is to support unsigned integer add and unsigned integermaximum up to certain precision. For example, the supercomputerdisclosed in the above-identified provisional patent application willsupport at least 32 bit, 64 bit and 128 bit unsigned integer sums ormaximums, plus a very long precision sum or maximum up to the 2048 bitpacket size. The combining functions in the network hardware providegreat flexibility in implementing high performance global arithmeticfunctions. A number of examples of these implementations are presentedbelow.

[0042] 1. Global Sum of Signed Integers

[0043]FIG. 5 shows the operation of global sum. Each participating nodehas an equal size array of numbers with the same number of arrayelements. The result of the global sum operation is that each node willhave the sum of the corresponding elements of arrays from all nodes.This relates to the MPI_SUM function in the MPI (Message PassingInterface) standard.

[0044] It is usually necessary to use a higher precision in the networkcompared to each local number to maintain the precision of the finalresult. Let N be the number of nodes participating in the sum, M be thelargest absolute value of the integer numbers to be summed, and2{circumflex over ( )} P be a large positive integer number greater thanM. To implement signed integer sum in a network that supports theunsigned operation, we only need to

[0045] (1) Add the large positive integer 2{circumflex over ( )} P toall the numbers to be summed so that they now become non-negative.

[0046] (2) Do the unsigned integer sum over the network.

[0047] (3) Subtract (N*2{circumflex over ( )} P) from the result.

[0048] P is chosen so that 2{circumflex over ( )} P>M, and(N*2{circumflex over ( )} (P+1)) will not overflow in the combiningnetwork.

[0049] 2. Global Max and Min of Signed Integers

[0050] These operations are very similar to the global sum discussedabove, except that the final result is not the sum of the correspondingelements but the maximum or the minimum one. They relate to MPI_MAX andMPI_MIN functions in the MPI standard with the integer inputs. Theimplementation of global max is very similar to the implementations ofthe global sum, as discussed above.

[0051] (1) Add a large positive integer 2{circumflex over ( )} P to allnumbers to make them non-negative.

[0052] (2) Do the unsigned global max over the network.

[0053] (3) Subtract 2{circumflex over ( )} P from the result.

[0054] To do a global min, just negate all the numbers and do a globalmax.

[0055] 3. Global Sum of Floating Point Numbers:

[0056] The operation of the global sum of floating point numbers is verysimilar to the earlier discussed integer sums except that now the inputsare floating point numbers. For simplicity, we will demonstrate summingof one number from each node. To do an array sum, just repeat the steps.

[0057] The basic idea is to do two round-trips on the combining network.

[0058] (1) Find the integer maximum of the exponent of all numbers,Emax, using the steps outlined in the discussion of the Global max.

[0059] (2) Each node will then normalize its local number and convert itto an integer. Let the local number on node “i” be X_i, whose exponentis E_i. Using the notation defined in the description of the Global sum,this conversion corresponds to the calculation,

A _(—) i=2{circumflex over ( )} P+2{circumflex over ( )}[P−(Emax−E)−1]*X _(—) i,   [Eq. 1]

[0060] Where A_i is an unsigned integer. A global unsigned integer sumcan then be preformed on the network using the combining hardware. Oncethe final sum A has arrived at each node, the true sum S can be obtainedon each node locally by calculating

S=(A−N*2{circumflex over ( )} P)/2{circumflex over ( )} (P−1).

[0061] Again, P is chosen so that N*2{circumflex over ( )} (P+1) willnot overflow in the combining network.

[0062] It should be noted that the step done in equation (1) above isachieved with the best possible precision by using a microprocessor'sfloating point unit to convert negative numbers to positive and then byusing its integer unit to do proper shifting

[0063] One important feature of this floating point sum algorithm isthat because the actual sum is done through an integer sum, there is nodependence on how the order of the sum is carried out. Eachparticipating node will get the exact same number after the global sum.No additional broadcast from a special node is necessary, which isusually the case when the floating point global sum is implementedthrough the normal message passing network.

[0064] Those skilled in the art will recognize that even of the networkhardware supports only unsigned integer sums, when integers arerepresented in 2's complementary format, correct sums will be obtainedas long as no overflow occurs on any final and intermediate results andthe carry bit of the sum over any two numbers are dropped by thehardware. The simplification of the operational steps to the globalinteger and floating point sums comes within the scope of the invention,as well as when the network hardware directly supports signed integersums with correct overflow handling.

[0065] For example, when the hardware only supports unsigned integersums and drops all carry bits from unsigned integer overflow, such asimplemented on the supercomputer, disclosed in provisional patentapplications No. 60/271,124 a simplified signed integer sum steps couldbe:

[0066] (1) sign extend each integer to a higher precision to ensure nooverflow of any results would occur; i.e., pad 0 to all the extendedhigh order bits for positive integers and zero, pad 1, to all theextended bit for negative integers.

[0067] (2) do the sum over the network. The final result will have thecorrect sign.

[0068] The above can also be applied to the summing step of floatingpoint sums.

[0069] With a similar modification from the description of the GlobalSum of integers to the description of the Global max, floating point maxand min can also easily be obtained.

[0070] There is also a special case for floating point max ofnon-negative numbers, the operation can be accomplished in one roundtrip instead of two. For numbers using the IEEE 754 Standard forFloating Point Binary Arithmetic format, as in most of the modernmicroprocessors, no additional local operations are required. Withproper byte ordering, each node can just put the numbers on thecombining network. For other floating point formats, like those used insome Digital Signal Processors, some local manipulation of the exponentfield may be required. The same single round-trip can also be achievedfor the min of negative numbers by doing a global max on their absolutevalues.

[0071] 4. Global All Gather Operation Using Integer Global Sum

[0072] The global all gather operation is illustrated In FIG. 6. Eachnode contributes one or more numbers. The final result is that thesenumber are put into an array with their location corresponding to wherethey came from. For example, numbers from node 1 appear first in thefinal array, followed by numbers from node 2, . . . , etc. Thisoperation corresponds to the MPI_ALLGATHER function in the MPI standard.

[0073] This function can be easily implemented in a one pass operationon a combining network supporting integer sums. Using the fact thatadding zero to a number yields the same number, each node simply needsto assemble an array whose size equals the final array, and then it willput its numbers in the corresponding place and put zero in all otherplaces corresponding to numbers from all other nodes. After an integersum of arrays from all nodes over the combining network, each node willhave the final array with all the numbers sorted into their places.

[0074] 5. Global Min_Loc and Max_Loc, Using Integer Global Max

[0075] These functions correspond to MPI_MINLOC and MPI_MAXLOC in theMPI standard. Besides finding the global minimum or maximum, an index isappended to each of the numbers so that one could find out which nodehas the global minimum or maximum, for example.

[0076] On a combining network that supports integer global max, thesefunctions are straight forward to implement. We will illustrate globalmax_loc as an example. Let node “j”, j=1, . . . , N, have number X_j andindex K_j. Let M be a large integer number, M>max(K_j), the node “j”only needs put two numbers:

[0077] X_j

[0078] M−K_j

[0079] as a single unit onto the combining network for global integermax. At the end of the operation, each node would receive:

[0080] X

[0081] M−K

[0082] Where X=max(X_j) is the maximum value of all X_j's, and K is theindex number that corresponds to the maximum X. If there is more thanone number equal to the maximum X, then K is the lowest index number.

[0083] Global min_loc can be achieved similarly by changing X_j to P−X_jin the above where P is a large positive integer number and P>max(X_j).

[0084] The idea of appending the index number behind the number in theglobal max or mix operation also applies to floating pointing numbers.With steps similar to those described above in the discussion of theprocedure for performing the global sum of floating point numbers.

[0085] 6. Other Operations:

[0086] On the supercomputer system described in the provisional patentapplication No. 60/271,124, additional global bitwise AND, OR, and XORoperations are also supported on the combining network. This allows forvery easy implementation of global bitwise reduction operations, such asMPI_BAND, MPI_BOR and MPI_BXOR in the MPI standard. Basically, each nodejust needs to put the operand for the global operation onto the network,and the global operations are handled automatically by the hardware.

[0087] In addition, logical operations MPI_LAND, MPI_LOR and MPI_LXORcan also be implemented by just using one bit in the bitwise operations.

[0088] Finally, each of the global operations also implies a globalbarrier operation. This is because the network will not proceed untilall operands are injected into the network. Therefore, efficientMPI_BARRIER operations can also be implemented using any one of theglobal arithmetic operations, such as the global bitwise AND.

[0089] 7. Operations Using Both the Torus and Tree Networks:

[0090] Depending on the relative bandwidths of the torus and treenetworks, and on the overhead to do the necessary conversions betweenfloating and fixed point representations, it may be more efficient touse both the torus and tree networks simultaneously to do globalfloating point reduction operations. In such a case, the torus is usedto do the reduction operation, and the tree is used to broadcast theresults to all nodes. Prior art for doing reductions on a torus areknown. However, in prior art, the broadcast phase is also done on thetorus. For example, in a 3 by 4 torus (or mesh) as illustrated at 30 inFIG. 7, reductions are done down rows, and then down columns by nodes atthe ends of the rows. In particular, in a sum reduction, FIG. 7 depictsnode Q20 inserting a packet and sending it to node Q21. Q21 processesthis packet by adding its corresponding elements to those of theincoming packet and then sending a packet containing the sum to Q22. Q22processes this packet by adding its corresponding elements to those ofthe incoming packet and then sending a packet containing the sum to Q23.These operations are repeated for each row. Node Q23 sums its localvalues to the corresponding values in the packet from Q22 and sends theresulting packet to Q13. Node Q13 sums its local values to those of thepackets from Q12 and Q23, and sends the resulting sum to Q03. Q03 sumsits local values to the corresponding values of the packet from Q13. Q03now has the global sum. In prior art, this global sum is sent over thetorus network to all the other nodes (rather than on the tree as shownin the figure). The extension to more nodes and to a higher dimensionaltorus is within the ability of those skilled in the art and within thescope of the present invention. For reductions over a large number ofvalues, multiple packets are used in a pipelined fashion.

[0091] However, the final broadcast operation can be done faster andmore efficiently by using the tree, rather than the torus, network. Thisis illustrated in FIG. 8. Performance can be optimized by having oneprocessor handle the reduction operations on the torus and a secondprocessor handle the reception of the packets broadcast on the tree.Performance can further be optimized by reducing the number of hops inthe reduction step. For example, packets could be sent (and summed) tothe middle of the rows, rather than to the end of the rows.

[0092] In a 3-dimensional torus, the straightforward extension of theabove results in a single node in each z plane summing their values upthe z dimension. This has the disadvantage of requiring those nodes toprocess three incoming packets. For example, node Q03z has to receivepackets from Q02z, Q13z, and Q03(z+1). If the processor is not fastenough this will become the bottleneck in the operation. To optimizeperformance, we modify the communications pattern so that no node isrequired to process more than 2 incoming packets on the torus. This isillustrated in FIG. 8. In this Figure, node Q03z forwards its packets tonode Q00z for summing down the z-dimension. In addition, node Q00z doesnot send its packets to node Q01z but rather receives a packet from nodeQ00(z+1) and sums its local values with the corresponding values of itstwo incoming packets. Finally, node Q000 broadcasts the final sum overthe tree network.

[0093] While it is apparent that the invention herein disclosed is wellcalculated to fulfill the objects stated above, it will be appreciatedthat numerous modifications and embodiments may be devised by thoseskilled in the art, and it is intended that the appended claims coverall such modifications and embodiments as fall within the true spiritand scope of the present invention.

What is claimed is:
 1. A method of performing arithmetic functions,using a shift and operate procedure, in a computer system having adistributed parallel torus architecture with a multitude ofinterconnected nodes, the method comprising the steps: providing each ofa group of the nodes with the same set of data values; performing aglobal arithmetic operation, wherein each of the nodes performs thearithmetic operation on all of the data values to obtain a final value;and ensuring that all of the nodes of the group perform said globaloperation on the data values in the same order to ensure binaryreproducible results.
 2. A method according to claim 1, wherein theensuring step includes the step of, each node performing the globalarithmetic operation after the node is provided with all of the datavalues.
 3. A method according to claim 2, wherein the providing stepincludes the step of each node of the group receiving the data valuesfrom other nodes of the group.
 4. A method according to claim 1, whereinthe nodes are connected together by bidirectional links, and theproviding step includes the step of sending the data values to the nodesin two directions over said links.
 5. A method according to claim 1,wherein the providing step includes the step of each one of the nodesinjecting one of the data values into the network only once.
 6. A methodaccording to claim 5, wherein the injecting step includes the step of,nodes of the group of nodes, other than said each one of the nodes,forwarding said one of the data values to other nodes of the group toreduce the latency of the global operation.
 7. A system for performingarithmetic functions, using a shift and operate procedure, in a computersystem having a distributed parallel torus architecture with a multitudeof interconnected nodes, the system comprising: a group of the nodesprovided with the same set of data values; means for performing a globalarithmetic operation, wherein each of the nodes performs the arithmeticoperation on all of the data values to obtain a final value; and meansfor ensuring that all of the nodes of the group perform said globaloperation on the data values in the same order to ensure binaryreproducible results.
 8. A system according to claim 7, wherein theensuring means includes means for performing the global arithmeticoperation at each node after the node is provided with all of the datavalues.
 9. A system according to claim 7, wherein each node of the groupreceives the data values from other nodes of the group.
 10. A systemaccording to claim 7, wherein the nodes are connected together bybidirectional links, and the providing means includes means for sendingthe data values to the nodes in two directions over said links.
 11. Asystem according to claim 7, wherein each one of the nodes injects oneof the data values into the network only once.
 12. A system according toclaim 7, wherein nodes of the group of nodes, other than said each oneof the nodes, forward said one of the data values to other nodes of thegroup to reduce the latency of the global operation.
 13. A programstorage device readable by machine, tangibly embodying a program ofinstructions executable by the machine to perform method steps forperforming arithmetic functions, using a shift and operate procedure, ina computer system having a distributed parallel torus architecture witha multitude of interconnected nodes, the method steps comprising:providing each of a group of the nodes with the same set of data values;performing a global arithmetic operation, wherein each of the nodesperforms the arithmetic operation on all of the data values to obtain afinal value; and ensuring that all of the nodes of the group performsaid global operation on the data values in the same order to ensurebinary reproducible results.
 14. A program storage device according toclaim 13, wherein the ensuring step includes the step of, each nodeperforming the global arithmetic operation after the node is providedwith all of the data values.
 15. A program storage device according toclaim 14, wherein the providing step includes the step of each node ofthe group receiving the data values from other nodes of the group.
 16. Aprogram storage device according to claim 13, wherein the nodes areconnected together by bi-directional links, and the providing stepincludes the step of sending the data values to the nodes in twodirections over said links.
 17. A program storage device according toclaim 13, wherein the providing step includes the step of each one ofthe nodes injecting one of the data values into the network only once.18. A program storage device according to claim 17, wherein theinjecting step includes the step of, nodes of the group of nodes, otherthan said each one of the nodes, forwarding said one of the data valuesto other nodes of the group to reduce the latency of the globaloperation.
 19. A method of performing an arithmetic function in acomputer system having a multitude of nodes interconnected by a globaltree network that supports integer combining operations, the methodcomprising the steps of: providing each of a group of nodes with firstvalues; processing each of the first values, according to a firstdefined process, to obtain a respective second value from each of thefirst values, wherein all of the second values are integer values; andperforming a global integer combine operation, using said second values,over the network.
 20. A method according to claim 19, wherein theperforming step includes the step of performing a global unsignedinteger sum over the network.
 21. A method according to claim 19,wherein the performing step includes the step of performing a globalmaximum operation over the network, and using results of said globalmaximum operation to identify the maximum of the first values.
 22. Amethod according to claim 19, wherein the performing step includes thestep of performing a global maximum operation over the network, andusing the results of said global maximum operation to identify theminimum of the first values.
 23. A system for performing an arithmeticfunction in a computer system having a multitude of nodes interconnectedby a global tree network that supports integer combining operations, themethod comprising the steps of: a group of nodes, each of the nodes ofthe group being provided with first values; a processor to process eachof the first values, according to a first defined process, to obtain arespective second value from each of the first values, wherein all ofthe second values are integer values; and means for performing a globalinteger combine operation, using said second values, over the network.24. A system according to claim 23, wherein the means for performingincludes means for performing a global unsigned integer sum over thenetwork.
 25. A system according to claim 23, wherein the means forperforming step includes means for performing a global maximum operationover the network, and for using results of said global maximum operationto identify the maximum of the first values.
 26. A system according toclaim 23, wherein the means for performing step includes means forperforming a global maximum operation over the network, and for usingthe results of said global maximum operation to identify the minimum ofthe first values.
 27. A program storage device readable by machine,tangibly embodying a program of instructions executable by the machineto perform method steps for performing an arithmetic function in acomputer system having a multitude of nodes interconnected by a globaltree network that supports integer combining operations, the methodsteps comprising: providing each of a group of nodes with first values;processing each of the first values, according to a first definedprocess, to obtain a respective second value from each of the firstvalues, wherein all of the second values are integer values; andperforming a global integer combine operating, using said second values,over the network.
 28. A program storage device according to claim 27,wherein the performing step includes the step of performing a globalunsigned integer sum over the network.
 29. A program storage deviceaccording to claim 27, wherein the performing step includes the step ofperforming a global maximum operation over the network, and usingresults of said global maximum operation to identify the maximum of thefirst values.
 30. A program storage device according to claim 27,wherein the performing step includes the step of performing a globalmaximum operation over the network, and using the results of said globalmaximum operation to identify the minimum of the first values.
 31. Amethod of performing a global operation on a computer system having amultitude of nodes interconnected by a global tree network that supportsinteger combining operations, the method comprising: providing each nodewith one or more numbers of any type; assembling the numbers of thenodes into an array, the array having a given number of positions, saidassembling step including the steps of i) each node putting one or moreof the numbers of the node into one or more of the positions of thearray, and putting zero values into all of the other positions of thearray, and ii) using the global tree network to sum all the numbers putinto each position in the array.
 32. A method according to claim 31,wherein: the given number of position of the array are arranged in adefined sequence; and the assembling step includes the further step of,each node establishing an associated array also having said given numberof positions arranged in the defined sequence, and putting the one ormore numbers of the node into one or more of the positions of theassociated array, and putting zero values in all of the other positionsof the associated array.
 33. A system for performing a global operationon a computer system having a multitude of nodes interconnected by aglobal tree network that supports integer combining operations, thesystem comprising: a group of nodes, each of the group of nodes havingone or more numbers; means for assembling the numbers of the nodes intoan array, the array having a given number of positions, said assemblingmeans including i) means for putting one or more of the numbers of thenodes of the group into one or more of the positions of the array, andfor putting zero values into all of the other positions of the array,and ii) means for using the global tree network to sum all the numbersput into each position in the array.
 34. A system according to claim 33,wherein: the given number of position of the array are arranged in adefined sequence; and the means for assembling further includes meansfor establishing a respective one array associated with each of thenodes of the group and also having said given number of positionsarranged in the defined sequence, for putting the one or more numbers ofeach of the nodes into one or more of the positions of the associatedarray, and for putting zero values in all of the other positions of theassociated array.
 35. A program storage device readable by machine,tangibly embodying a program of instructions executable by the machineto perform method steps for performing a global operation on a computersystem having a multitude of nodes interconnected by a global treenetwork that supports integer combining operations, the method stepscomprising: providing each node with one or more numbers; assembling thenumbers of the nodes into an array, the array having a given number ofpositions, said assembling step including the steps of a. each nodeputting one or more of the numbers of the node into one or more of thepositions of the array, and putting zero values into all of the otherpositions of the array, and b. using the global tree network to sum allthe numbers put into each position in the array.
 36. A method accordingto claim 35, wherein: the given number of position of the array arearranged in a defined sequence; and the assembling step includes thefurther step of, each node establishing an associated array also havingsaid given number of positions arranged in the defined sequence, andputting the one or more numbers of the node into one or more of thepositions of the associated array, and putting zero values in all of theother positions of the associated array.
 37. A method of performing anarithmetic function in a computer systems having a multitudeinterconnected by a global tree network that supports integers combiningoperations, the method comprising the step of: each of the nodescontributing a set of first values; and performing a global integercombine operation, using said first values, over the network.
 38. Amethod according to claim 37, wherein the performing step includes thestep of using results of he global integers combine operations toidentify a characteristic of the first values
 39. A system forperforming an arithmetic function is a computer system having amultitude of nodes interconnected by a global tree network that supportsintegers combining operations, the system comprising: a group of nodes,each of the nodes of the group consisting of set of first values; and aprocessor to perform a global integer combine operations, using thefirst values, over the network.
 40. A system according to claim 39,wherein the processor includes means for using results of the globalintegers combine operation to identify a characteristic of the firstvalues.
 41. A method of operating a parallel processing computer systemhaving a multitude of nodes interconnected by both a global tree networkand a torus network, the method comprising: using the computer systemsto perform defined operations; and using both the torus and treenetworks to cooperate on reduction operations.
 42. A method according toclaim 41, wherein the step of using both the torus and tree networksincludes the step of doing so by having one processor handle torusoperations and another processor handle the tree operations.
 43. Amethod according to claim 41, wherein the step of using both the torusand tree networks includes the step of doing so by arranging the toruscommunications so that, in a three-dimensional torus, no node on thetorus receives more than two packets to combine.
 44. A program storagedevice readable by machine, tangibly embodying a program of instructionsexecutable by the machine to perform method steps for operating aparallel processing computer system having a multitude of nodesinterconnected by both a global tree network and a torus network, themethod step comprising: using the computer systems to perform definedoperations; and using both the torus and three networks to cooperate onreduction operations.
 45. A program storage device according to claim44, wherein the step of using both the torus and tree networks includesthe step of doing so by having one processor handle torus operations andanother processor handle the tree operations.
 46. A program storagedevice according to claim 44, wherein the step of using both the torusand tree networks includes the step of doing so by arranging the toruscommunications so that, in a three-dimensional torus, no node on thetorus receives more than two packets to combine.
 47. A parallelprocessing computer system comprising: a multitude of nodes; a globaltree network also interconnecting the nodes; a torus network alsointerconnecting the nodes; and means for using both the torus and treenetworks to cooperate on reduction operations.
 48. A computer systemaccording to claim 47, wherein the means for using both the torus andtree networks include one processor to handle torus operations andanother processor to handle the tree operations.
 49. A computer systemaccording to claim 47, wherein the means for using both the torus andtree networks includes means for arranging the torus communications sothat, is a three-dimensional torus, no node on the torus receives morethan two packets to combine.